safety measure
Elon Musk's Grok 'Undressing' Problem Isn't Fixed
X has placed more restrictions on Grok's ability to generate explicit AI images, but tests show that the updates have created a patchwork of limitations that fail to fully address the issue. Elon Musk's X has introduced new restrictions stopping people from editing and generating images of real people in bikinis or other "revealing clothing." The change in policy on Wednesday night follows global outrage at Grok being used to generate thousands of harmful non-consensual "undressing" photos of women and sexualized images of apparent minors on X. However, while it appears that some safety measures have finally been introduced to Grok's image generation on X, the standalone Grok app and website seem to still be able to generate "undress" style images and pornographic content, according to multiple tests by researchers, WIRED, and other journalists. Other users, meanwhile, say they're no longer to create images and videos as they once were.
- North America > United States > Minnesota (0.05)
- North America > United States > California (0.05)
- Europe > United Kingdom (0.05)
- (12 more...)
- Information Technology > Security & Privacy (0.47)
- Media > News (0.35)
A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)
Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable generalization capabilities across multiple challenging distribution shifts. However, there is still much to be explored in terms of their robustness to the variations of specific visual factors. In real-world applications, reliable and safe systems must consider other safety measures beyond classification accuracy, such as predictive uncertainty. Yet, the effectiveness of CLIP models on such safety-related objectives is less-explored. Driven by the above, this work comprehensively investigates the safety measures of CLIP models, specifically focusing on three key properties: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs.
Engineers propose massive airbags for airplanes
The system uses an AI model that would trigger a Kevlar bubble cocoon in the event of a crash. 'REBIRTH is more than engineering--it's a response to grief,' the researchers wrote. Breakthroughs, discoveries, and DIY tips sent every weekday. An Air India flight from Ahmedabad bound for London spent just 30 seconds in the air before disaster struck earlier this year . Preliminary reports indicate that the aircraft's fuel control switches were inexplicably turned off shortly after takeoff, cutting fuel to the engines and causing total power loss.
- Asia > India (0.26)
- North America > United States > Vermont (0.05)
- North America > United States > New York (0.05)
- Europe > Ukraine (0.05)
- Transportation > Air (1.00)
- Transportation > Ground > Road (0.48)
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Korbak, Tomek, Balesni, Mikita, Barnes, Elizabeth, Bengio, Yoshua, Benton, Joe, Bloom, Joseph, Chen, Mark, Cooney, Alan, Dafoe, Allan, Dragan, Anca, Emmons, Scott, Evans, Owain, Farhi, David, Greenblatt, Ryan, Hendrycks, Dan, Hobbhahn, Marius, Hubinger, Evan, Irving, Geoffrey, Jenner, Erik, Kokotajlo, Daniel, Krakovna, Victoria, Legg, Shane, Lindner, David, Luan, David, Mądry, Aleksander, Michael, Julian, Nanda, Neel, Orr, Dave, Pachocki, Jakub, Perez, Ethan, Phuong, Mary, Roger, Fabien, Saxe, Joshua, Shlegeris, Buck, Soto, Martín, Steinberger, Eric, Wang, Jasmine, Zaremba, Wojciech, Baker, Bowen, Shah, Rohin, Mikulik, Vlad
AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
- North America > United States (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Information Technology > Security & Privacy (0.46)
- Government > Military (0.46)
Exclusive: New Claude Model Triggers Stricter Safeguards at Anthropic
This moment is a crucial test for Anthropic, a company that claims it can mitigate AI's dangers while still competing in the market. Claude is a direct competitor to ChatGPT, and brings in over 2 billion in annualized revenue. Anthropic argues that its RSP thus creates an economic incentive for itself to build safety measures in time, lest it lose customers as a result of being prevented from releasing new models. "We really don't want to impact customers," Kaplan told TIME earlier in May while Anthropic was finalizing its safety measures. "We're trying to be proactively prepared." But Anthropic's RSP--and similar commitments adopted by other AI companies--are all voluntary policies that could be changed or cast aside at will.
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods
Grey, Markov, Segerie, Charbel-Raphaël
As frontier AI systems advance toward transformative capabilities, we need a parallel transformation in how we measure and evaluate these systems to ensure safety and inform governance. While benchmarks have been the primary method for estimating model capabilities, they often fail to establish true upper bounds or predict deployment behavior. This literature review consolidates the rapidly evolving field of AI safety evaluations, proposing a systematic taxonomy around three dimensions: what properties we measure, how we measure them, and how these measurements integrate into frameworks. We show how evaluations go beyond benchmarks by measuring what models can do when pushed to the limit (capabilities), the behavioral tendencies exhibited by default (propensities), and whether our safety measures remain effective even when faced with subversive adversarial AI (control). These properties are measured through behavioral techniques like scaffolding, red teaming and supervised fine-tuning, alongside internal techniques such as representation analysis and mechanistic interpretability. We provide deeper explanations of some safety-critical capabilities like cybersecurity exploitation, deception, autonomous replication, and situational awareness, alongside concerning propensities like power-seeking and scheming. The review explores how these evaluation methods integrate into governance frameworks to translate results into concrete development decisions. We also highlight challenges to safety evaluations - proving absence of capabilities, potential model sandbagging, and incentives for "safetywashing" - while identifying promising research directions. By synthesizing scattered resources, this literature review aims to provide a central reference point for understanding AI safety evaluations.
- North America > United States (0.27)
- Europe > France (0.04)
- Oceania > Papua New Guinea (0.04)
- (3 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- (5 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (3 more...)
AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations
Bowen, Dillon, Dombrowski, Ann-Kathrin, Gleave, Adam, Cundy, Chris
The rapid advancement of AI systems has raised widespread concerns about potential harms of frontier AI systems and the need for responsible evaluation and oversight. In this position paper, we argue that frontier AI companies should report both pre- and post-mitigation safety evaluations to enable informed policy decisions. Evaluating models at both stages provides policymakers with essential evidence to regulate deployment, access, and safety standards. We show that relying on either in isolation can create a misleading picture of model safety. Our analysis of AI safety disclosures from leading frontier labs identifies three critical gaps: (1) companies rarely evaluate both pre- and post-mitigation versions, (2) evaluation methods lack standardization, and (3) reported results are often too vague to inform policy. To address these issues, we recommend mandatory disclosure of pre- and post-mitigation capabilities to approved government bodies, standardized evaluation methods, and minimum transparency requirements for public safety reporting. These ensure that policymakers and regulators can craft targeted safety measures, assess deployment risks, and scrutinize companies' safety claims effectively.
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
DiffGuard: Text-Based Safety Checker for Diffusion Models
Khader, Massine El, Bouzidi, Elias Al, Oumida, Abdellah, Sbaihi, Mohammed, Binard, Eliott, Poli, Jean-Philippe, Ouerdane, Wassila, Addad, Boussad, Kapusta, Katarzyna
Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.
- Media (0.68)
- Government > Military (0.48)
- Information Technology > Security & Privacy (0.46)
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Zhou, Yujun, Yang, Jingdong, Guo, Kehan, Chen, Pin-Yu, Gao, Tian, Geyer, Werner, Moniz, Nuno, Chawla, Nitesh V, Zhang, Xiangliang
Laboratory accidents pose significant risks to human life and property, underscoring the importance of robust safety protocols. Despite advancements in safety training, laboratory personnel may still unknowingly engage in unsafe practices. With the increasing reliance on large language models (LLMs) for guidance in various fields, including laboratory settings, there is a growing concern about their reliability in critical safety-related decision-making. Unlike trained human researchers, LLMs lack formal lab safety education, raising questions about their ability to provide safe and accurate guidance. Existing research on LLM trustworthiness primarily focuses on issues such as ethical compliance, truthfulness, and fairness but fails to fully cover safety-critical real-world applications, like lab safety. To address this gap, we propose the Laboratory Safety Benchmark (LabSafety Bench), a comprehensive evaluation framework based on a new taxonomy aligned with Occupational Safety and Health Administration (OSHA) protocols. This benchmark includes 765 multiple-choice questions verified by human experts, assessing LLMs and vision language models (VLMs) performance in lab safety contexts. Our evaluations demonstrate that while GPT-4o outperforms human participants, it is still prone to critical errors, highlighting the risks of relying on LLMs in safety-critical environments. Our findings emphasize the need for specialized benchmarks to accurately assess the trustworthiness of LLMs in real-world safety applications.
- North America > United States (1.00)
- Asia > India > Andhra Pradesh > Visakhapatnam (0.04)
A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)
Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable generalization capabilities across multiple challenging distribution shifts. However, there is still much to be explored in terms of their robustness to the variations of specific visual factors. In real-world applications, reliable and safe systems must consider other safety measures beyond classification accuracy, such as predictive uncertainty. Yet, the effectiveness of CLIP models on such safety-related objectives is less-explored. Driven by the above, this work comprehensively investigates the safety measures of CLIP models, specifically focusing on three key properties: resilience to visual factor variations, calibrated uncertainty estimations, and the ability to detect anomalous inputs.